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ABSTRACT 

The production of artificial light curves with known statistical and variability proper- 
ties is of great importance in astrophysics. Consolidating the confidence levels during 
cross-correlation studies, understanding the artefacts induced by sampling irregulari- 
ties, establishing detection limits for future observatories are just some of the appli- 
cations of simulated data sets. Currently, the widely used methodology of amplitude 
and phase randomisation is able to produce artificial light curves which have a given 
underlying power spectral density (PSD) but which are strictly Gaussian distributed. 
This restriction is a significant limitation, since the majority of the light curves e.g. 
active galactic nuclei, X-ray binaries, gamma-ray bursts show strong deviations from 
Gaussianity exhibiting 'burst-like' events in their light curves yielding long-tailed prob- 
ability distribution functions (PDFs). In this study we propose a simple method which 
is able to precisely reproduce light curves which match both the PSD and the PDF 
of either an observed light curve or a theoretical model. The PDF can be representa- 
tive of either the parent distribution or the actual distribution of the observed data, 
depending on the study to be conducted for a given source. The final artificial light 
curves contain all of the statistical and variability properties of the observed source 
or theoretical model i.e. same PDF and PSD, respectively. Within the framework of 
Reproducible Research, the code, together with the illustrative example used in this 
manuscript, are both made publicly available in the form of an interactive mathe- 
MATICA notebook. 

Key words: methods: statistical - methods: numerical - galaxies: nuclei - galax- 
ies: active - X-rays: galaxies - gamma-rays: galaxies - X-rays: binaries - galaxies: 
individual: NGC4051, 3C 454.3 - X-rays: individual: CygX-1 
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Currently in astrophysics, artificial lig ht curves are usu- 
ally c onstructed using the procedure of lTimmer fc Koenid 
l| 19951 ) (hearafter TK95). This method is able to produce 
ensembles of non-deterministic, normally distributed time 
series from a given underlying power spectral density (PSD) 
model, &(f), which represents the variability power as a 
function of temporal frequency, / . Resembling the method 
proposed bv lDavies fc Ha rtc (1987), it randomises correctly 
both the phase and the amplitude of the Fou rier compo- 
nents, thus advancing on the previous method of lDone et al.l 
(1992), which randomises only the phase, assuming a deter- 
ministic amplitude which causes a long-term trend in the 
resulting simulated data sets. 

There are numerous applications of the TK95 procedure 
in a plethora of astrophysical fields, including: 
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• Its use in the establishment of statistic al confidence 
intervals during cross-cor relation studies (e.g. lAgudo et al.l 
l201ll : iBartlett et afll2013h . 

Its appl i cation during microlensing studies (e.g 



Ofek fc Maol 120031 : iKoptelova et alj 120101 ; iTewes et all 
20121) . 

The detection of variability in large catalogues and sur 



veys jBauer et al.ll2009l: |V illforth et al.ll2010l : lMacLeod et all 



l20ld :lP rimini et al.ll201ll ). as well as for the detection of the 
smallest variability tim e-scales embedded in a data set (e.g. 
lAharonian et al.ll2007t) . 

• Determination of the detection limits of astronomical 
instruments for a given type of astrophysical source (e.g . 
AGN, GRBs) (e . g. lGreene et all I20I0I; iPrimini et al J 1201 J : 
iDoro et al. I l2013l : iKhabibullin et al1l2012l ). 

• Its use in the derivation of confidence intervals during 
the study of quasi-per iodic oscillations of galactic and extra- 
galac t ic objects (e.g . iBenlloch et al.l 120011 ; [Gicrlihsk i et al.l 
120081 : IDo et alll2009l ). as well as for the statistical charac- 
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terisation of pe riodic and pulsed pat t erns in stella r pho- 
tometric data ( Stanishev et alj |2002| ; iGrosso et al.l l20ld: 
iBlomme et al.ll201ll ). 

• Its central role in the estimation of the underlying PSD 
of irregularly sa mpled AGN li g ht cu rves within the proce- 
dure proposed by lUttlev et all l|2002h . 

• Its vital use in the study of both th e powers and limi- 
tations of a giv e n statistical method (e.g.lZhang et al.l 20041; 



ng 

Vaughanl 120051 ; lEmmanoulopoulos et alj 201p| ; 
201ll ). 
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Its use i n the fields of Solar astrophysics 
iaguru et alj I2004T ) and geophysics (|Venema et al.l 
l2006h . 

At this point, it is very important to note that the above- 
mentioned method of TK95 is appropriate for the produc- 
tion of Gaussian artificial light curves only. This means that 
the resultant surrogate data set£| preserve only the first two 
statistical moments of the original data set, i.e. the mean 
value, fx, and the variance, a 2 , ignoring potential higher or- 
der statistical moments, such as skewness, kurtosis, or multi- 
modes found in the normalised flux distribution of the data, 
i.e. probability density function (PDF) , corresponding either 
to the parent or observed distribution. Thus, Gaussian light 
curves show on average the same amplitude variations above 
and below the mean, resulting in a zero skewness distribu- 
tion of data points. Another characteristic is that since the 
surrogates are normally distributed, there is always a finite 
probability for the artificial data points to become negativfl 
However, the light curve of any astronomical source must by 
default remain positive and so must the resulting PDF. 

In this framework, TK95 methodology can be used 
to simulate observed data sets which are Gaussian dis- 
tributed in the broad sense i.e. having a negligible skewness 
and/or kurtosis. On a large number of occasions however, 
light curves exhibit a 'burst'-like beha viour e.g. in X-rays 
with RXTE and X MM-Newton (e.g. IChitnis et ail |2009| ; 
IVaughan et~al]|201 ll) . in 7-rays wit h the Large Area Tele- 
scope (LAT) on-b oard Fermi (e.g. IChatteriee et all 120091 ; 
lAgudo et al1l201ll ). in which the events are distributed fol- 
lowing right heavy-tailed distributions. This implies occur- 
rence probabilities of high flux values larger than those ex- 
pected from Gaussian distributions and thus the Gaussian 
TK95 products can not be used for the establishment of 
confidence intervals e.g. in cross-correlation studies. 

Furthermore, the rms-flux relation, i.e. the linear scaling 
of the fractional root mean square (rms) variability ampli- 
tude with the flux, observed in both AGN and X-ray bi- 
naries ( XRBs) duttlev fc McHardvll200ll ; lUttlev et al. 2005; 
lGandhi|[2009l ; lMcHardy||2010l). c annot be reproduced by the 
TK95 algorithm. lUttlev et all (120051 ) show that such be- 
haviour can arise from a non-linear multiplicative variability 
process in which the parent distribution follows a log-normal 
distribution. The authors therefore suggest a modification to 
the TK95 products involving exponentiation (in base e) of 
the normally distributed artificial data sets, yielding light 



The term 'surrog ate' data set (after iTheiler et al.lll992l) will be 
used throughout the manuscript in exactly the same way as the 
terms 'artificial', 'simulated' and 'synthetic' data set. 
2 For a given point this probability is equal to 0.5 erfc[/i/(cr\/2)] 
where erfc denotes the complementary error function. 



curves which both possess a log-normal distribution and ex- 
hibit the rms-flux relation. Although the normalisation of 
the input PSD, descale (/)) is selected in such a way that the 
variance of the final products matches that of the observed 
light curve, the actual shape of the PSD is distorted from 
the original one, £P(f)6f, in such a way that the actual vari- 
ability power within a given frequency range, ^ r cscaio(/)5/, 
differs from the genuine one, £?(f)8f. 

Apart from this PSD distortion, the exponentiation 
transformation cannot be generalised to arbitrary parent 
or observed distributions (depending on the type of study). 
The specification of the parent distribution requires either 
very large data set (e.g. for the case of CygX-1 253144 
data poin ts are required to f orm the parent log-normal dis- 
tribution lUttley et al.1 [2005) or a theoretical model (e.g. 
iKellv et al.ll201ll ). Employment of the parent distribution 
is of vital importance in comparing variability properties of 
data sets obtained over a long period of time which map 
the complete variability behaviour of the source. Neverthe- 
less, it is sometimes crucial to establish the detection sig- 
nificance of a given result coming from a single observed 
data set. This approach has been used several times in the 
field of reverberation studies, in the f orm of flux redistribu - 
tion or random subset selection (e.g. IPeterson et al.lll998h . 
or the detection of time lags in very high energy (VHE) 
Che renkov astronomy in th e framework of Quantum Grav- 
ity l|Aharonian et al" I2008T I . For the case of transient phe- 
nomena in particular, e.g. GRBs, in which the concept of a 
parent distribution is not applicable, only a single realisa- 
tion is available for each observation and thus this should 
be used as the PDF. 

In this paper we put forward a simple method 
which combines the routine of TK95 and the itera- 
tive amplitude adju s ted F ourier transform algorithm of 
ISchreiber fc Schmit j i| 19961 ) (hearafter SS96), which pro- 
duces artificial light curves which possess exactly the same 
PSD and PDF as the originally observed light curve (or a 
theoretical model). Thus, the surrogates will have exactly 
the same variability and statistical properties as the ob- 
served light curve. Initially, in Sect. [5] we describe in de- 
tail the method. For illustrative purposes, in Sect. [3] we 
then apply it to the case of the well-studied type-I Seyfert 
AGN NGC4051, using XMM-Newton observations. Follow- 
ing that, in Sect. [4] we produce artificial light curves for the 
7-ray blazar 3C 454.3, using Fermi-LAT observations, and 
the XRB CygX-1 using observation obtained by the All Sky 
Monitor (ASM), on-board RXTE. In Sect. [5] we present an 
application in cross correlation analysis, and in Sect. [6] we 
reproduce the rms-flux relation for the case of log-normally 
distributed light curves. In Sect. [7] we discuss which proper- 
ties of the light curves are preserved during our simulation 
process and finally, a discussion together with a summary 
of our results, can be found in Sect. [8] In the Appendix |A"1 
we give the basic definitions and properties of the various 
quantities i.e. periodogram, PSD and PDF, as well as the 
various fitting procedures that will be used throughout this 
manuscript. In Appendix [B] we elucidate the differences be- 
tween statistical moments and cumulants, which are com- 
monly confused in the astronomical literature. 

Throughout the manuscript the error estimates for the 
various best-fitting model parameters correspond to the 90 
per cent confidence intervals unless otherwise stated. The 
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error bars of the plot points in all the figures indicate the 
68.3 per cent confidence intervals. 



2 METHODOLOGY 
2.1 The algorithm 

This method is a combination of TK95 and the SS96, with 
some significant alterations and modifications which join the 
two together. 

Consider an observed light curve x b s (t) consisting 
of N uniformly sampled observations (sampling rate At), 
{ti, x bs(ti)} for i = 1,2, ...,N. The light curve has an 
underlying PSD, and an observed (or 'parent', de- 

pending on the purpose of the statistical study) PDF, 
PDF [0 sC x ohB (t) < oo]. Note that both/either PSD and/or 
PDF can also originate from a theoretical model which we 
want to check the statistical properties of its products i.e. 
time series. Note that if one wishes to take into account the 
various spectral distortion effects (Sect. 1273")) then one should 
adjust both the simulation length N and the time resolution 
accordingly as described in Sect. 12.31 

(i) Using the TK95 procedure, a normally distributed 
time serie^j is produced, x noTln (t), consisting of N values 
and an underlying PSD identical to <^(/). Then, for each 
Fourier frequency, fj, the discrete Fourier transform (DFT), 
DFT nolln (j), is estimated and from this the corresponding 
amplitudes, &£ niat m.(J), phases, </wm(i), and periodogram, 
-Pnorm(/j) (equations IA2I IA3I and IA41 respectively). Note 
that since the iteration algorithm aims to produce artificial 
source light curves, the input PSD should not contain the 
Poisson noise component. 

(ii) From the PDF [0 ^ x b s (t) < oo] a series of pseudo- 
random numbers is produced which forms a white noise data 
set, x s jm,i(f). Then, at each Fourier frequency, the discrete 
Fourier transform of £sim,i(t) is estimated, DFT B i m ,\Q), and 
from that the corresponding amplitudes, f/ s im,i(j), phases, 
0sim,i(i), fj and periodogram, Psim,i(/j fl 

(iii) Spectral adjustment: For each frequency, fj, the 
amplitudes #/ s im,l(j) are replaced with the amplitudes 
^norm(i), whilst keeping the phases </> S im,i(i) unaltered. This 
yields the adjusted DFT of :r sim ,i(t), MTsim.adjust.iC?), on 
which we then perform an inverse discrete Fourier transform 
(IDFT), yielding the time series, £sim.adjust,i(t)- This time 
series has an identical underlying PSD to the desired one, 
!P(f), but with a distribution of measurements which has 
been altered from that of PDF [0 ^ x b B (t) < oo]. 

(iv) Amplitude adjustment: A new time series is cre- 
ated from the values of K s jm,i(^) ordered based on the 
ranking of Zsim. adjust, This means that the the highest 
value of a; s im. adjust, i(t) is replaced by the highest value of 

3 Actually this is an asymptomatically normally distributed time 
series. Despite the fact that TK95 corresponds to a realisation of 
a Gaussian process, individual artificial data sets products (of 
finite length) may not be necessarily normally distributed since 
the Gaussianity of the process limits the asymptotic distribution 
only, N — > oo. 

4 For £sim,l(£) the periodogram, P s im,i(/)i corresponds by de- 
fault to an underlying PSD with a slope of a = (since it repre- 
sents a white noise process). 



x s im,i(t), the second highest value of a; s im.adjust,i(t) is re- 
placed by the second highest value of x B i m> i(t) and so on. 
The resulting data train, x S i m ,2(t), is distributed exactly as 
PDF [0 < a; bs(t) < oo] but its PSD differs from the target 
one, 3?{f). 

(v) The same process is repeated in an iterative fashion 
K-times, starting from step (ii), until the resulting products 
remain the same i.e. £ s i m ,k+i(i) = %sim,k(t) (convergence): 

• 2 nd iteration: a; 8 im,i(t) is replaced by a; s i m ,2(t) 

• 3 rd iteration: x B i m ^(t) is replaced by x B i m ,3(t) 

• K, th iteration: x s im, K -i(t) is replaced by x s im, K (t) 

After a given number of iterations, e.g. A = K + 1, the 
synthetic light curve products do not change (i.e. conver- 
gence) and thus the a; s im,A(^) iterated product comprises the 
final artificial light curve product. The exact number of it- 
erations depends on the length of the original data set, the 
underlying input PSD and the input PDF. More about the 
convergence can be found in Sect. [3701 and Sect. [372721 us- 
ing Monte Carlo simulations. Note that for the case of a 
Gaussian PDF the iteration process gives exactly equivalent 
results with the TK95 products since (step i) yields prod- 
ucts which are already Gaussian distributed. The flowchart 
of the abovementioned method is given in Fig. [T] 

2.2 Appropriate treatment of the Poisson noise 

The final simulated product, K s im,x(t), has the desired 
distribution and PSD corresponding to the source light 
curve produced. Since the observed light curve is a prod- 
uct of a counting detector-process the observations are af- 
fected by Poisson noise, which is imprinted in the cor- 
responding PSD as a constant component. In order to 
mimic this effect, each light curve point £sim,A(i) = 
{xaim,\(ti),Xsim,x(t2), ■ ■ ■ ,x S i m ,x(tN)} is replaced by an ap- 
propriate Poisson random variate 

Pois[n = x sim , x (U)At] . 

£sim,Pois,A(£i) -^T for I = 1, . . . , N (1) 

where Pois[x s i m ,x(ti)At] dictates the probability mass func- 
tion of the Poisson distribution with a mean value of 
x silrli x(ti)At. 

2.3 Spectral distortions: Red noise leak and 
aliasing. 

In the case of a non-white noise PSD (as in the case of AGN 
light curves) the periodogram estimates tend to be biased 
due to 'red noise leak' (the transfer of variability power from 
the low to high frequencies due to the finite len gth of ob- 
servations; |Deeter_&_13o3mtons 1982; Dcctcr 1984) and alias- 
ing effects (fold-back of variability power from high frequen- 
cies to lower f requencies due to the finite time resolution; 
lKirchner|[2005l V 

These are two very well understood spectral distor- 
tions induce by the sampling properties of the data set and 
they can be taken into account in the usual manner (e.g. 
lUttlev et al]|2002» . In order to take into account the 'red 
noise leak' effect we produce surrogate data sets which are 
much longer than the observed data set (e.g. 100 times) and 
then we randomly select a subset having the desired length. 

With respect to the aliasing, since we are dealing 
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Figure X. The flow chart diagram showing the various steps of 
the method. 



with data averaged over time intervals, A£ samp i e , rather 
than simply sampl ed data its effect is very much reduced 
|van der Klidll988r ). Nevertheless, if one also wishes to in- 
clude the aliasing effect in the simulations, for the case of 
unbinned data (e.g. count data), then one should increase 
the time resolution of the simulations e.g. to 10 per cent of 

A^sample • 

With these two approaches, the dependencies in both 
the Fourier amplitudes (equation lA2[) and the Fourier phases 
(equation I A3|) are taken into consideration. However, if one 
wishes to carry out statistical studies that deal only with 
the Fourier amplitudes (as in all the examples shown in this 
manuscript) then the length and binning adjustments can be 
applied during the first step of the abovementioned method 
i.e. during the application of the TK95 procedure. In this 
way the TK95 artificial light curves, :r norm (i), carry all the 
spectral distortion effects that involve only the Fourier am- 
plitudes, which will then be passed to the final surrogates 



during the spectral adjustment stage (step iii). An major ad- 
vantage to this is that the whole process is much faster since 
the iteration process involves data sets which have lengths 
equal to that of the observed data set. 

Note once again that for statistical studies that involve 
Fourier phases, e.g. phase-lag spectra studies, one should 
follow the initial recipe i.e. carry out the whole simulation 
for longer and more finely sampled surrogate data sets, and 
select a subset which has the desired length from the final 
converged iteration product. The effect of red-noise leak on 
the phases is rarely discussed in the literature despite the 
fact that it adds significant dependencies to the phases. 

For demonstrative purposes, we create an artificial light 
curve, using the TK95 procedure, consisting of 10 6 data 
points, binned in 100 s, with an input PSD which has a 
power-law model of slope -2.5. We then chop the light curve 
into 1000 segments each one consisting of 1000 consecutive 
points. For each data set and for each Fourier frequency, fj 
we estimate the DFT (equation IA1|I and from this its am- 
plitude (via the periodogram, equation I A4f) and its phase 
(equation I A3|l . Finally, for each fj, we average the the peri- 
odogram estimates and phases; the results are shown in the 
top panels of Fig. [2] 

The top-left panel of Fig. [2] shows clearly the effect of 
red noise leak for the amplitudes, something that has been 
extensively discussed in the literature. The top-right panel 
of Fig. [2] shows vividly that the red noise leak also affects 
the phases in a very distinctive way. The onset of the effect 
is around 10~ 4 Hz (the same as it is for the amplitudes) and 
from then on all the phases follow an arched trend towards 
negative values. The last point in this plot, corresponds to 
the phase estimate at the Nyquist frequency, /jv/2 = /Nyq> 
for which the DFT is a always a real number (positive or 
negative). This means that, for /Nyq, we average only the 
phases corresponding to the negative values, which are al- 
ways 7r, over the total number of points, since for the positive 
values the arg is not formally defined (in this context one 
could consider it equal to 0) . On average one should get val- 
ues around (4>n/2) — tv/2 — 1.57 (i.e. roughly equal numbers 
of positive and negative values). 

We repeat the same process but this time with a con- 
stant underlying PSD i.e. a power-law with slope 0. As we 
can see from the bottom panels of Fig.[2]both the amplitudes 
(bottom-left panel) and the phases (bottom-right panel) are 
not affected by the red noise leak effect. It is clear that, as 
in the case of the Fourier amplitudes, the effect of red noise 
leak for the Fourier phases depends on the shape of the input 
PSD, e.g. the softer the power-law of the underlying PSD, 
the greater the effect of the red noise leak. As we discussed 
previously, our methodology correctly takes this effect into 
account by extending the total length of the surrogate data 
set and then chopping the converged final synthetic light 
curve to the desired length. 



2.4 Basic differences and advantages from 
previous works 

Our method is essentially a marriage of TK95 and SS96 al- 
gorithms. The former remains exactly the same during the 
application of the method but the latter (i.e. the iterative 
amplitude adjusted Fourier transform) contains several key 
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Figure 2. The effect of red noise leak in the Fourier amplitudes and phases. [Top-left panel] The averaged periodogram estimates for 
an input PSD with a power-law shape of a slope -2.5. [Top-right panel] The averaged Fourier phases for an input PSD with a power-law 
shape of a slope -2.5. [Bottom-left panel] The averaged periodogram estimates for an input PSD with a power-law shape of a slope 0. 
[Bottom-right panel] The averaged Fourier phases for an input PSD with a power-law shape of a slope 0. 



differences, from the SS96, which makes it suitable particu- 
larly for the needs of astronomical data sets. 

(i) We use a pseudo-random data set following the es- 
timated distribution, rather than a shuffled version of the 
original observed data set. 

(ii) We replace the Fourier phases of the TK95 products 
rather than those of the original data set. 

(iii) All the spectral distortion effects due to the finite 
length and sampling rate of the observed data set (i.e. red 
noise leak and aliasing) are taken into account. 

The coupling of the two methods allow us to study not only 
observed light curves but also theoretical models that give 
predictions about the PSD and the PDF of a given astro- 
physical object. The TK95 is carrying the spectral informa- 



tion (PSD) and the SS96 is distributing the various measure- 
ments (PDF) accordingly. In this way, we can produce, based 
on a theoretical model, realistic and positively defined non- 
Gaussian synthetic light curves as opposed to only Gaussian 
light curves coming from TK95. 

The problem of generating stochastic sequences of num- 
bers with specified properties is extensively analysed in 
the literature since the ea rly 70's (for a complete refer- 
ence guide see ISowevlll986h . In particular, iLiu fc Munsonl 
( 1982) proposed a white Gaussian noise input to a lin- 
ear digital filter followed by a zero-memory non linearity 
(ZMNL). The ZMNL is chosen so that the desired distri- 
bution is exactly realised and the digital filter is designed 
so that the desired autoc ovariance is closely approximated. 
iHunter fc Kearney! l|l983t) proposed a method for the gener- 
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Figure 3. The XMM-Newton data set of NGC4051. [Left panel] The EPIC pn and MOS combined light curve in the 0.5-10 keV 
energy-band in bins of 100 s (obs ID: 0109141401, revolution: 0263). [Right panel] The corresponding periodogram estimates, P b s (/)i 
(grey points) and the underlying best-fit PSD model, 0*(f; Tbf > Cbf)i (black line). 



ation of random number sequences with an arbitrarily spec- 
ified first-order probability distribution function (PDF) and 
an arbitrarily specified first-order autocorrelation function 
(ACF). The procedure involves a stochastic optimisation al- 
gorithm which minimizes the squared sum between the de- 
sired (output) and the actual (observed) ACF estimates. 

An iterative m ethod was developed by 
lYamazaki fc Shinozukal (|l988l ) which generates Gaus- 
sian distributed samples with a given periodogram which 
are then mapped into non- Gaussian distributed numbers. 
This is achieved by employing the invert expression of 
the target PDF (distribution distortion method) and the 
iteration process aims to correct the altered periodogram 
estimates (as they come out from the mapping process) 
to match the desideratum p eriodogram. Th e correlation 
distortion method was used bv lJohnsonl (1 19941 ). and consists 
of a nonlinear transformation which is applied to construct 
non-Gaussian correlated features from correla ted Gaussian 
random draws. Finally, iGurlev et al.l (Il996l ) presented a 
series of mathematical approaches using Volterra series and 
analytical kernels to achieve bispectral matching. In the 
same work a neural network system identification model is 
employed for simulation also demonstrating the ability to 
match higher order spectral characteristics. 

Besides the abovementioned differences, our method 
differs fundamentally from all the previous methodologies 
with respect to the matching process of the PSD. We are 
not interested in matching the individual periodogram es- 
timates (derived from the observed data set), but instead 
in the underlying PSD. In this way, at a given Fourier fre- 
quency fj, the various periodogram estimates, P(fj), are 
distributed asymptotically around £P(fj) as a gamma dis- 
tribution, Y[u/2,3 s (fj)] (equation IA9|rl with u degrees of 



freedom (d.o.f.) corresponding to v — 1 for the the Nyquist 
frequency and v = 2 for all other frequencies. 

2.5 A publicly available code in the form of an 
active document 

In the spirit of Reproducible Results and Active Documents 
jClaerboutl 1 1990] ) . we provide an interactive mathematica 
notebook (created with the version: 9.0.1.0) which contains 
the complete numerical code together with the example pre- 
sented in Sect. [3] In detail the notebook contains: 

• The XMM-Newton data set of the AGN NGC4051 
which is used in Sect. [3] 

• A version of TK95 code taking into consideration (if 
needed) the spectral distortions described in Sect. 12.31 

• The iteration algorithm (SS96). 

• The addition of the Poisson noise as described in 
Sect. E2] 

• An animation of the simulated products at the various 
iteration steps. 

It can be found on the wet0 as part 

of this paper (Online Material) or at 

http : //www. astro . sot on. ac . uk/~dele08/Art if icialLight Curves/ 

By changing the two random seeds (used for step i and step 

ii), the whole document is automatically updated and a 

new artificial light curve is produced. The simple numerical 

code, provided in the mathematica notebook, can be 

written in a much more compact form (i.e. much more 

computationally efficient), but for clarity purposes we have 

split it up in various programming lines. Due to its simple 

nature, the code can be implemented in any programming 

language. 



5 In the literature this is usually referred to as 'scaled \ 2 distri- 6 You can also request the MATHEMATICA notebook via e-mail to 

bution' with 2 d.o.f. (equation IA8II . D. EmmanoulopoulosOsoton.ac.uk 
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3 APPLICATION: THE CASE OF NGC 4051 

3.1 Step-by-step procedure for a single realisation 

We will now apply the method to a single X-ray data set of 
the type I Seyfert galaxy NGC 4051 (z = 0.02336) obtained 
by the European Photon Imaging camera aboard XMM- 
Newton observatory (obs ID: 0109141401, revolution: 0263). 
The observed 0.5-10 keV light curve of NGC 4051 (sample 
light curve) is shown in the left panel of Fig. [3] consisting of 
N — 1170 data points in bins of 100 s. 

The following process is carried out for illustrative pur- 
poses only and aims to simulate a single artificial light curve 
with the same underlying PSD as the sample light curve, and 
an identical observed PDF (as opposed to the parent PDF). 
Nevertheless, depending on the purpose of the study one can 
select an appropriate PDF depicting either the underlying 
or the observed statistical properties. The method, of course 
can also produce artificial light curves coming directly from 
a theoretical model, which specifies an underlying PDF and 
PSD, without the requirement of having an actual observed 
light curve. Since, the PSD and the PDF of the observed 
data set are the two input parameters of the method, we 
first estimate these. Note that since we will not be perform- 
ing any studies involving Fourier phases, the effect of 'red 
noise leak' has been taken into account during (step i) i.e. by 
producing a TK95 artificial data set 1000 times longer than 
the original NGC 4051 data set. For this particular data set 
(something which is generally true for XMM-Newton data 
sets), the aliasing effect is insignificant since we are dealing 
with averaged consecutive measurements. 

Initially we derive the periodogram of the sample light 
curve, P bs(,f), (Fig. right panel, grey points) and then 
we estimate the underlying PSD by fitting the smoothly 
bending power-law model plus a constant, c, representing 
the Poisson noise level: 



1 + (///bend) Qhi 



gh-«lo 



+ C 



(2) 



in which 7 = {A, /bend) «low> CKhigh}) with components the 
source's PSD model parameters i.e. normalisation, bend fre- 
quency, low and high frequency slopes, respectively. Dur- 
ing the fit we fix t he Qi ow to 1.1 ( as de rived from long- 
term RXTE data; iMcHardv et all 120041 ') and the best- 
fitting model parameters are 7bf = {0.030 ± 0.004 Hz -1 , 

2.3to 2 9 x 10 " 4 Hz > L1 » ^oiaol} and c bf = 9- 2 ±o:s x 10-3 

Hz -1 (Fig. [3] right panel, black line). Note that the de- 
riv ed best-fitting values agree entirely with the ones given 
bv lVaughan et all l|201lf ) but that the best -fit /bend differs 
from t he value 8I3 x 10 -4 Hz estimated bv lMcHardv et all 
(2004) (note that the error estimates correspond to the 90 
per cent confidence intervals). This best-fitting model will be 
the target PSD which should be matched by the surrogate 
data sets. 

In order to assess the probability distribution of the 
sample data set we then form its probability density function 
histogram (Fig. [4] left panel, black line). The latter exhibits 
two clear modes: the first narrow mode corresponds to the 
low count rate regimes (e.g. the regions around 35, 55 and 
110 ks in the left panel of Fig. [3]), and the second, broader 
mode, to the high source states. We parametrise the ob- 
served distribution of the sample data set by fitting a prob- 
ability model. For this particular data set of NGC 4051, we 



select a mixture distribution model consisting of a gamma 
distribution, T(n,8) with n and 9 being the shape and the 
scale parameters, and a log-normal distribution, In Af(fi, a 2 ), 
with fj, and a 2 being the mean and the variance of the count 
rate's (variable x) natural logarithm (equation [3]). Finally, 
each of these component distributions contribute to the over- 
all PDF of the mixture distribution, f m i x (:r; ff), with a weight 
of wr and w\ n ss = 1-wr respectively, with an ff being a vec- 
tor consisting of the model parameters ff = {k, 9, /i, a, wr}' 



fmix(x; jy) = 



6- K e—/"a 

r(«) 



+ WlnjV" 



2-irxa 



(3) 



The best-fitting PDF model is shown superimposed on 
the data sample histogram in the left panel of Fig. [4] 
(grey line) with best-fitting model parameters of f%{ = 
{5.67ig;gS, 5.96±g;8J, 2.14 ± 0.06, 0.3ll°;°!, 0.82t°;°|}. Hav- 
ing as null hypothesis, Ho, that the data set is drawn from 
the derived best-fit model distribution and alternative hy- 
pothesis, Ha,, that it was not drawn from that distribu - 
tion, the Anderson-Darling test (jAnderson fc Darlindll952h 
yields a statistic value of 0.34 corresponding to an Ho prob- 
ability of 0.89 which depicts the good representation of the 
data by the given model. 

Having defined the best-fitting PSD and PDF we con- 
tinue to the actual production of the artificial light curves: 

(i) The best-fitting PSD model with c = is then used 
to create the normally distributed time series with a peri- 
odogram, Pnorm(/j). A simulated light curve of this kind to- 
gether with its periodogram is shown in Fig. [S] (left and right 
panel, respectively). By estimating the DFT of £norm(t), for 
each Fourier frequency, fj , we derive the corresponding am- 
plitudes and phases, ^norm(i) and <^norm(j) respectively. 

(ii) The best-fitting PDF model is used to generate a list 
of N pseudo-random variates, :Esim,i(t), shown in the right 
panel of Fig. [4] Then, by estimating for each Fourier fre- 
quency the DFT of x s im,i(£), DFT a im,i(j), we derive the cor- 
responding amplitude and phases, £/sim,i(j) and s im,i(j)> 
respectively. 

(iii) Spectral adjustments: #/ a im,i(j) are replaced by 
^4orm(i), keeping the 0sinx,iO) unaltered, yielding an ad- 
justed version of DFT aim ^(j), DfT s i m . ad ju S t,i(j)- By per- 
forming an IDFT of we obtain the light curve a; s i m . a dj ua t,i(t) 
(Fig. |6j left panel) with an identical periodogram to 
fnorm(/) (Fig. |6] right panel, grey points), but now 
with measurements which are not longer distributed as 
fmix(x;rfbf) (Fig. El left panel, inset). 

(iv) Amplitude adjustments: Finally, the values of 
Xsim. adjust, i(t) are replaced by the values of x s im,i(i), based 
on the ranking of the former. The resulting light curve, 
£ s im,2(i), (Fig. [3 left panel) has an identical histogram with 
the sample light curve, but this time the periodogram es- 
timates do not correspond to the target underlying PSD, 
^(/;7bf, 0). The right panel of Fig. [7] shows the binned 
logarithmic periodogram estimates (in bins of 15 consec- 
utive periodogram estimates) of the amplitude adjusted 
light curve together with !P(f; Jbt, 0) and the ratio plot i.e. 
data/model. From the ratio plot it is obvious that partic- 
ularly the high frequency periodogram estimates, in par- 
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Count rate (counts-s ) t (ks) 



Figure 4. Step ii. [Left panel] The PDF histogram of the observed data (black line) together with the best fit mixture distribution 
model, f m i x (a;; rf b {) (gray line, equation [3j . [Right panel] An ensemble of N pseudo-random variates produced from equation J3j and the 
inlay shows its corresponding PDF histogram. 




t (ks) v (Hz) 

Figure 5. Step i. [Left panel] A normally distributed simulated light curve created using the original data NGC4051's best-fitting PSD 
(c = 0) and its PDF histogram (inset). [Right panel] The corresponding periodogram estimates (grey points) and the underlying target 
PSD model, #(/;7 bf ,0) (black line). 



ticular above 10 3 Hz, are systematically larger than the 
corresponding £^(/;7bf,0) values by a factor of 1.5. 



All the abovementioned procedure is a single iteration step 
of the method. Exactly the same process is repeated itera- 
tively from step iii, by replacing the iCsini,i(i) with the am- 
plitude adjusted light curve x s im,2(t), the a; s im,2(i) with the 
£ s im,3(i) and so on. For this particular case, after the 55 th 
iteration the synthetic light curves remain the same. 



3.1.1 Convergence of a single artificial light curve 

In order to check the convergence of the method, we fit the 
bending power-law model (equation [2| to the correspond- 
ing periodogram estimates of each iteration step. The re- 
sulting values for the ai ow , Qhigh and /bend are shown in 
Fig. which we can see form a plateau after the 55 th 



7 The results of the first iteration are excluded from the pan- 
els in order to cover better the variations of the other iterated 

"high = 3.21 and 



2.56, 



products. The omitted values are ct\ a 
l°Sio[/bcnd (Hz)] = —1.90), respectively. These large deviations 
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t (ks) v(Hz) 



Figure 6. Step iii. [Left panel] The spectrally adjusted light curve together with its PDF histogram (inset). [Right panel] The correspond- 
ing periodogram estimates (grey points), are by construction identical to those shown in the right panel of Fig. [5] and the underlying 
target PSD, 0* (f ; % { , 0) (black line). 
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Figure 7. Step iv. [Left panel] The amplitude adjusted light curve together with its PDF histogram (inset). [Right panel] The corre- 
sponding binned logarithmic periodogram estimates (open circles) and the underlying target PSD model, 3?(f\ 7bf , 0) (black line) having 
attached in the bottom the ratio plot, data/model. 



iteration. Fig. [9] shows the corresponding results for the 
56 th iteration; the synthetic light curve (left panel) follows 
the exact distribution of the observed data and the binned 

result from the fact that the minimisation routine does not lo- 
calise the minimum (after 500 iteration steps) for the initial pe- 
riodogram estimates, P s i m i(/), corresponding to a flat PSD (see 
footnote [4j . Naturally, by increasing the number of iterations we 
can correct for this artefact but in this context it is unnecessary 
since for the next steps the localisation of the minimum occurs 
in less than 30 iterations since the degeneracy «high = Q low — 
does not exist any more. 



logarithmic periodogram estimates (in bins of 15 consecu- 
tive periodogram estimates) as expected follow, within the 
68.3 percent confidence levels depicted by the error bars, 
the corresponding £P(f;'yM,0). This means that the 56 th 
synthetic light curve is the final simulated data set, with 
all the desired statistical and variability properties of the 
original data set. The best fitting PSD model for the 56 th 
surrogate yields: oa ow = 1.36_ ; 3g , cthigh = 2.24_ ; 05 and 
/bend = 3.41^3 x 10 -4 Hz and the resulting histogram is by 
construction identical to the original one since it is drawn 
from its best fitting PDF model (Fig. [4] Right panel) . 
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Figure 8. Convergence of the iteration process (the first itera- 
tion is not shown -see footnote [7}). The horizontal solid grey lines 
indicate the corresponding target values i.e. the best fitting val- 
ues as derived from the observed data. The dotted line among 
the various iterations is a linear interpolation intended only to 
guide the eye. [Top panel] The oq ow estimates as a function of 
the iteration number (in logarithmic scale), stabilising at 1.36. 
[Middle panel] The «high estimates as a function of the iteration 
number (in logarithmic scale), stabilising at 2.24. [Bottom panel] 
The logarithm of /bend estimates as a function of the iteration 
number (in logarithmic scale), stabilising at -3.47. 



Finally, in order to take into account the Poisson statis- 
tics (Sect (221) we re-sample the 56 th surrogate data set ac- 
cording to equation [1] The resulting artificial light curve is 
shown in Fig. 1101 This single random synthetic data set en- 
closes all the information of our initial data set and thus can 
be used in any sort of statistical study. 



3.2 Overall procedure for an ensemble of 
realisations 

3.2.1 Proposed methodology 

In this section we repeat the abovementioned procedure for 
an ensemble of 1000 realisations and we compare the sta- 
tistical properties of the final products to those of the orig- 
inal data set of NGC 4051 i.e. the light curve and under- 
lying PSD (Fig. [3J). Initially, we perf orm a goodness-o f-fit 
Kolmogorov-Smirnov hypothesis test (|Press et al.lll992l ) for 
the distribution of each artificial light curve, with Ho that 



the surrogate data set is drawn from the best fit model dis- 
tribution of NGC 4051 (Fig. [4] left panel) and i7 a that it was 
not drawn from that distribution. The mean Kolmogorov- 
Smirnov statistic derived from the ensemble of light curves 
is D n = 0.0251o'qo6 an d the mean Ho probability derived is 
0.511q 22' depicting the high degree of accordance between 
the distribution of the artificial data sets and that of the 
original data set (the error estimates correspond to the 68 
per cent confidence intervals). Thus, this method assures 
that the resulting simulated data sets have the same statis- 
tical moments as the observed light curve of NGC 4051. 

We then fit the PSD model of equation [5] to the peri- 
odogram estimates of each artificial light curve. The distri- 
butions of both the low and the high frequency PSD slopes, 
as well as the the bending frequency, are shown in the left 
and right panels of Fig. 1111 respectively. The sample mean 
values, together with their 68.3 per cent confidence limits 
(i.e. standard deviation of the sample mean) and the 68.3 
per cent confidence intervals of the distributions for aa ow , 
Qhigh and /bond, are given in Table [T] The simulation re- 
sults which come from the proposed method, are entirely 
consistent with those derived from the original data set of 
NGC 4051, indicating that there are no biases towards the 
PSD model parameters which could cause systematic de- 
viations from the targeted values. Thus, the artificial light 
curves produced as an ensemble with this algorithm have the 
same variability power, as a function of Fourier frequency, 
as that of NGC 4051. 



3.2.2 Convergence of the ensemble of artificial light curves 

The scattering in the various estimated PSD model param- 
eters, coming from the 1000 simulated light curves, origi- 
nates from the asymptotic distribution of the various pe- 
riodogram estimates, P(fj), around the input input PSD, 
^(/;7bf,0). As we discussed in Sect. 12.41 (see for details 
Appendix IA2|) at given Fourier frequency fj, P(fj) is dis- 
tributed asymptotically around ^(/; jbf , 0) as a gamma dis- 
tribution, F [v/2, &{f; 7bf , 0)] with v d.o.f. This behaviour 
is depicted in the left panel of Fig. [T^] which shows the dis- 
tribution of the 1000 periodograms around the target PSD. 
As a sanity check for our simulations, we test whether for a 
given Fourier frequency, fj , the distribution of their various 
periodogram estimates is indeed in accordance with equa- 
tion El Thus, 



derive for each fj the distribution of 
points and we perform an Anderson-Darling test goodness- 
of-fit test with Hq: the periodogram estimates at a given fj 
are drawn from the F [v/2, ^(f; 7bf, 0)] distribution and Z/ a 
that they are not drawn from this distribution. The mean 
value of the statistic is 2.92^q '±3 yielding a mean Ho proba- 
bility 0.181q'q6 depicting the high degree of accordance be- 
tween the estimated distribution of the simulated products 
and the expected ones (the error estimates correspond to 
the 68 per cent confidence intervals). 

Finally, depending on the type of the statistical study, 
it is not necessary always for each surrogate data set to 
carry out the iteration process up to the convergence point 
(as shown in Fig. [8}. Stopping the process in an intermedi- 
ate step e.g. at the 5 th iteration step, will yield surrogate 
data sets which will still have accurate PSD parameters (i.e. 
they will be distributed correctly around the target values 
without systematic trends), but the various estimates will 
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56 Ih iteration: Amplitude adjusted light curve (step iv) - 




t (ks) v(Hz) 

Figure 9. Step iv for the 56 th iteration. [Left panel] The amplitude adjusted light curve together with its PDF histogram (inset). [Right 
panel] The corresponding binned logarithmic periodogram estimates (open circles) and the underlying target PSD model, ^(Z; 7bf , 0) 
(black line) having attached in the bottom the corresponding ratio plot, data/model. 
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Figure 10. The 56 th surrogate data set re-sampled from a 
son distribution as dictated by equation [T] (At = 100 s). 



Pois- 



be less precise than those derived from the final converged 
products (i.e. they will exhibit larger scatter around the tar- 
get values). Nevertheless, the differences are very small and 
for this particular example (5 th iteration step) are on aver- 
age of the order of 5 per cent. In the right panel of Fig. [12] 
we show this effect by plotting the convergence in the PSD 
parameter a n i g h for 15 synthetic data sets (as we did in the 
top panel of FigJHJ) . The publicly available mathematica 
notebook (Sect. 12. 5p contains an animation showing these 
small differences between all the iteration steps for a single 
surrogate. 



3.2.3 Exponential light curves 

In this section we follow the recipe of lUttlev et all (|2005t) 
and exponentiate (in base e) the TK95 products which are 
produced by the renormalised P SD model (using equations 
13 and 14 in lUttlev et al1l2005D . In this case, the artificial 
data sets always follow by construction a log-normal dis- 
tribution which differs intrinsically from the observed sta- 
tistical properties (described by equation |3J) which we are 
interested in reproducing for the given illustrative purposes. 
Thus we do not need to perform a goodness-of-fit hypothesis 
test for the distributions, since we know a priori that they 
are by construction different. 

In the next step, we repeat the PSD model fitting proce- 
dure for the periodogram estimates of the exponential light 
curves. The results for the distribution of the best fitting 
parameters of ai OW) «hi g h and /bend are shown in Fig. 1131 Fi- 
nally, as above, the sample mean values together with their 
68.3 per cent confidence limits (i.e. the standard deviation 
of the sample mean) and the 68.3 per cent confidence inter- 
vals of the distributions for Qi ow , ahigh and /bend are given 
in Table Q] We can see that the PSD becomes systematically 
softer by ar ound 8 per cent som ething which is also shown 
in fig. Bl in lUttlev et~ai] l|2005t ). Most importantly for this 
particular case the most noticeable distortion appears in the 
bend frequency which systematically shifts towards higher 
frequencies, deviating in this way by 70 per cent from the 
target value. 

Using these simulated data sets for the recovery of the 
bend frequency o f an irregularly sam pled light curve (using 
the procedure of lUttlev et al.l 120021 ) will yield systematic 
deviations from the true underlying value. Note that the 
degree of the various PSD distortions of the exponential light 
curves depend on the particular variability properties of the 
light curves, as well as the actual values of the underlying 
PSD model. 

A potential solution to these spectral alterations could 
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fflow , Ohigh log 10 [/bend (Hz)] 

Figure 11. Overall simulation results for 1000 artificial light curves. [Left panel] The histogram of the best fitting ai ow (grey lines) 
and ctMgh (black lines). The solid lines corresponds to the target values coming directly from the observed data 1.1 and 2.2 respectively, 
the dashed line corresponds to the mean estimate of the distribution, 1.12 and 2.21 respectively, and the dotted line corresponds to 
best fitting value as derived from the 56 th surrogate of the single realisation, 1.36 and 2.24 respectively (Fig. [8] top and middle panel, 
respectively). [Right panel] The histogram of the best fitting logarithms of /bend- The solid line corresponds to the target value coming 
directly from the observed data /bend = 2.3 X 10 -4 Hz (or -3.64 in log scale with base 10), the dashed line corresponds to the mean 
estimate of the distribution, 2.4 X 10 — 4 Hz (or -3.62 in log scale with base 10), and the dotted line corresponds to best fitting value as 
derived from the 56 th surrogate of the single realisation, 3.4 X 10 — 4 Hz (or -3.47 in log scale with base 10, Fig. [8] bottom panel). 



Table 1. Global simulation results for the PSD parameters. 



Model-parameter Target values t Proposed method* Exponential function method* 

«iow 1.1 (fixed) 1.123±g; 00 f, [0.87, 1.20] 0.973±g;gg|, [0.86,1.03] 

a high 2.20± ; 4 2. 213jl°;oS5i, [2.15,2.26] 2.063 ± 0.002, [2.00, 2.10] 

/bend (xlO" 4 Hz) 2.3±ojj 2.4 ±0.1, [2.1,3.3] 3.9 ± 0.1, [3.2, 5.0] 



t These are the values of NGC 4051 derived in Section ^. II 

* The first value is the sample mean together with its 68.3 per cent confidence limits and the second value in the square brackets 
corresponds to the 68.3 per cent confidence intervals of the distribution around the mean. 



be the following: To consider the logarithm of the observed 
data set (which is Gaussian distributed for the case of a 
log-normal distribution) and estimate its PSD which is then 
going to be used as the input PSD for the TK95 simula- 
tion. The exponentially transformed TK95 products should 
follow the original PSD of the observed data set which is 
log-normally distributed. Before following this recipe, fur- 
ther investigation of this approach should be carried out, 
something which is out of the scope of this manuscript. 



4 COMPLIMENTARY APPLICATIONS: 
FERMI AND RXTE DATA SETS 

In order to show the wide applicability of our newly pro- 
posed method, we further apply it on two radically different 
looking data sets: a 7-ray Fermi-LAT data set for the blazar 
3C 454.3, an X-ray RXTE data set for the XRB CygX-1. 
The 7-ray blazar 3C 454.3 

We use the weekly Fermi-LAT light curve of 3C 454.3 con- 
sisting of 236 points between 54684 and 56334 MJD in the 



0.1-300 GeV energy rangCl The light curve is shown in the 
left panel of Fig. [14] with the black points corresponding to 
the actual flux measurements and the grey points (around 
20 Ms and 90-130 Ms) to the 90 per cent confidence up- 
per limits. For the purposes of this study we have simply 
used the upper limits as actual flux measurements but more 
precise treatment using survival analysis techniques will be 
presented in a future work. 

To remind the readers, the two basic components for the 
method are the distribution of the data as well as the cor- 
responding PSD. The PDF histogram of the data is shown 
in the left panel of Fig. [Tl] (left inset) and as we can see 
it is characterised by a long right-tail which becomes zero 
at much higher flux values from those expected by a simple 
exponential distribution. Note that if we were about to fit 
an exponential distribution PDF model to this data set, it 
would yield a best-fitting inverse scale of 4.25 ± 0.06 hav- 

8 The Fermi-LAT data has been retrieved from: 
http : / /f ermi .gsf c .nasa. gov/ ssc/data/access/lat/msl_lc/ 
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Figure 12. Distribution and convergence of the ensemble periodogram estimates. [Left panel] The distribution of the 1000 periodograms 
(grey points), originating from the 1000 synthetic light curves, around the underlying PSD, 7bf , c^f), (black, solid line). Note that 

the synthetic light curves contain Poisson noise. [Right panel] Convergence of the PSD parameter c%j g h as estimated from fitting the 
periodogram estimates of the 15 synthetic data sets for all the iteration steps. 



200 




Figure 13. Overall results for 1000 exponential light curves. [Left panel] The histogram of the best fitting c*i ow (grey lines) and cthigh 
(black lines). The solid lines corresponds to the target values coming directly from the observed data 1.1 and 2.2 respectively, the dashed 
line corresponds to the mean estimate of the distribution, 0.97 and 2.06 respectively. [Right panel] The histogram of the best fitting 
logarithms of /bend- The solid line corresponds to the target value coming directly from the observed data /bend = 2.3 X 10~ 4 Hz (or 
-3.64 in log scale with base 10), the dashed line corresponds to the mean estimate of the distribution, 3.9 X 10~ 4 Hz (or -3.41 in log scale 
with base 10). 



ing a very poor fit quality, with the Anderson- Darling test 
statistic value of 30.21 and an Ho probability (i.e. the data 
set is drawn from a population with the fitted distribution) 
of 0. The right inset in the same plot shows the periodogram 
estimates of the light curve together with the best fit bend- 
ing power-law model (eq. [2}, a\ ow = a h i g h = 1.3 j^* and 
c = 2.4 ± 0.3 Hz" 1 (/tend = 7.8l" x 10" 8 Hz), imply- 



ing that a simple power-law model is enough to describe 
the data. Using these two data components, we apply our 
method and we produce an artificial light curve (convergence 
occurs after 22 iterations) which conserves all the statistical 
and variability properties of the original data set (Fig. 1141 
right panel). Thus, ensembles of such artificial light curves 
can be used in any sort of statistical analysis that requires 
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establishment of confidence intervals e.g. in CCF analysis 
during multiwavelength campaigns. 
The XRB CygX-f 

We use the daily-averaged RXTE-ASM light curve of CygX- 
1 consisting of 5000 points between 50135.4 and 55422.6 
MJD in the 2-10 keV energy rang^fl- The light curve is shown 
in the left panel of Fig. [TS] and contains a small number 
of gaps (289 in total), which we have been filled up with 
linearly interpolated values. Appropriate treatment, using 
bootstrapping, should be performed in order to check the 
effects of the gaps during the following PSD estimation, but 
for the purposes of this study we ignore this stetF°l. 

The PDF histogram of the data is shown in the left 
panel of Fig. [15] (left inset) having a characteristic bimodal 
shape, depicting the high and the low flux states of the 
source. Assuming that no artefacts are induced to the peri- 
odogram estimates, due to the interpolation, the best fit 
PSD model yields qi ow = 0.49tg;2?, "high = l-58±g:i|, 



1.32±g;| 3 x 10" 



Hz and c = 3692±^ Hz" 1 (Fig. \T5\ 



left panel, right inset). After applying our method (con- 
vergence occurs after 267 iterations), the resulting artificial 
light curve fFig. 1151 right panel) resembles remarkably to the 
original data set of Cyg X-l which was chosen as an example 
of extreme 'bursticity'. 



5 APPLICATION TO CCF ANALYSIS 

CCF analysis is one the most common method used for 
analysing multiwavelength light curves obtained in a si- 
multaneous fashion. There are several different flavours 
and implementa tions of the CCF e.g. discrete correlation 
function (DCF) ilEdelson fc KroUldl 19881 ) , interpolated C CF 
l|Gaskell fc Sparkelll986l) . modified CCF dLi et all 120041), z - 
transform discrete correlation function I Alexander! I1997T ) 
that are used within the astronomical community. Particu- 
larly for the case of irregularly sampled light curves, estima- 
tion of the confidence levels, in both the CCF values and/or 
time delays, is usually done by performing Monte Carlo sim- 
ulations. Application of a given CCF method to an ensemble 
of paired random artificial light curves, which have the same 
PSD as the observed data sets, yields the probability of get- 
ting a given CCF estimate purely by chance coincidence. At 
present the simulated light curves are produced using the 
TK95 formalism. 

Since the TK95 synthetic data sets are distributed nor- 
mally by construction, the method is appropriate to yield 
CCF confidence levels only for Gaussian light curves. In or- 
der to show that deviations in the CCF levels can occur for 
the case of non-Gaussian light curves, for illustrative pur- 
poses, we create two 'bursty' light curves in the following 
way: Using the TK95 procedure, we produce two artificial 



been retrieved from: 



9 The RXTE-ASM data has 
http : //xte .mit . edu/ASM_lc . html] 

iu The frequency domain bootstrap methodologies are used to es- 
timate the distribution of the re-sampled periodogram estimates. 
The main difficulty is to select appropriate statistical estima- 
tors whose variance fits that of the re-sampled periodogram esti- 
mates (at a given frequency). Useful ana lyses on this topic have 
been performed by several authors (e.g. iFranke fe Hardlel Il992l ; 
iDahlhaus fe Janasj|l99^ : I Kreiss fe Paparoditisll2003f) . 



light curves having different bending PSDs, XG{t) and J/g(£), 
200 ks long with a bin size At = 100 s, using the same ran- 
dom seed (in order to be correlated) . During the production 
of ya(t), we multiply the imaginary part with a random 
number between [—0.15,0). This will slightly modify the fi- 
nal flare profiles i.e. amplitudes and phases (equations IA2I 
and IA3|) yielding an asymmetric CCF profile around the 
zero time delay. Finally, the resulting normally distributed 
numbers are used as exponents for the bases 2 and 1.5 re- 
spectively in order to produce two 'bursty-like' light curves, 
x(t) and y(t), (Fig. 1171 left panel) having of course different 
underlying PSD parameters from the initial ones. The initial 
PSD parameters, used for the TK95 methodology, together 
with the final PSD parameters, estimated after fitting the 
periodogram estimates of the exponentiated light curves (in 
base 2 and 1.5, respectively), are given in Table[2] Note that 
in both light curves we have added Poisson noise following 
the recipe described in Sect. 12.21 For the purposes of this 
study these two 'bursty' light curves will be used as two si- 
multaneously obtained observations of the same object, in 
different energy bands, for which we will perform CCF anal- 
ysis. 

Initially, we estimate the DCF for the two 'bursty' light 
curves (Fig. 1171 right panel, black points). Then, in order 
to assess the confidence level of the correlation we produce 
two ensembles of 1000 pairs of artificial light curves; one 
following the classical procedure of TK95 and another one 
using the proposed methodology described in Sect .[2] Then, 
for each method we estimate the DCFs between all the pairs 
and for each time delay, r, we estimate the 0.025 and 0.975 
quantiles corresponding to the upper and lower limits of the 
90 per cent confidence bands. 

As we can see from the right panel of Fig. [17] (TK95: 
grey lines, new method: black lines) realistic representation 
of the non-Gaussian light curves yields in generally an in- 
crease in the confidence level range of the order of 25 per 
cent which reduces the detection significance of the DCF 
peak. The reason is that Gaussian light curves have on av- 
erage the same number of flares above (positive direction) 
and below (negative direction) the mean, in contrast to the 
'bursty' light curves which exhibit flares only in the positive 
direction. That means that between two 'bursty' light curves 
it is much more likely to get a fake correlation by chance co- 
incidence since there is only one possible flare direction. 

This can be very well understood with the following 
toy-simulation. We produce a series of 2000 pairs of time 
series each one consisting of 100 positive and negative trian- 
gular positive pulses (the simplified analogue of a Gaussian 
light curve) occurring at uniformly random non-repetitive 
integer numbers and we measure the number of simultane- 
ous pulse occurrences by chance coincidence between all the 
pairs. Then, we repeat the simulation but now the time se- 
ries consist only of positive triangular unit pulses (the sim- 
plified analogue of a 'bursty' light curves). As we can see 
from Fig.[T6]the mean occurrence of chance coincidence cor- 
related events is almost doubled for the case of the posi- 
tive triangular pulses due to the flare directionality prop- 
erty. This shows that for non-Gaussian light curves TK95 
underestimates the chance coincidence occurrences of corre- 
lated events and thus yields erroneous smaller estimates for 
the confidence intervals i.e yielding an overestimation of the 
CCF's peak significance. 
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Fermi-LAT: 3C 454.3 Artificial light curve 
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Figure 14. The Fermi-LAT data set of the blazar 3C 454.3. [Left panel] The weekly averaged 7-ray light curve in the energy range of 
0.1—300 GeV with black points corresponding to actual flux measurements and grey points to the 90 per cent confidence upper limits. 
The left and right insets show the PDF histogram and the periodogram estimates (grey points) together with the best fit power-law 
model (black line), respectively. [Right panel] A single artificial light curve coming after 22 iterations (convergence). 




Figure 15. The RXTE-ASM data set of the XRB CygX-1. [Left panel] The daily averaged X-ray light curve in the energy range of 2-10 
keV. The left and right insets show the PDF histogram and the periodogram estimates (grey points) together with the best fit bending 
power-law model (black line), respectively. [Right panel] A single artificial light curve coming after 267 iterations (convergence). 



6 THE RMS-FLUX RELATION 

In the special case of a parent log-normal distribution, the 
rms-flux relation (see Sect. [1} may be sometimes of vital 
importance for the needs of a statistical study or a theo- 
retical model. The surrogate data sets, following a parent 
log-normal distribution, have embedded this property in a 
natural way without the need for further adjustments or 
tuning. 

To show that our method automatically produces the 



rms-flux relationship, we first create a sample light curve 
which inherently has the rms-flux relation by following a 
parent log-normal distribution. Using the TK95 procedure 
(with initial PSD parameters: ai ow = 1.5, Qhigh = 2.8 and 
/bend = 1 x 10~ 3 Hz) we produce a synthetic data set, being 
200 ks long in bins of 100 s, and then we exponentiate the 
resultant data set (having final PSD parameters: cvi ow — 

l.ll±o:Si» «high = 2.481^6 and /bend = (2.2 ± 0.6) x 10" 4 
Hz (Fig. 1181 lefy panel). This light curve will be used as the 
observed light curve that we want to simulate. 
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Figure 16. Correlated events for the toy-simulation. The black 
line corresponds to the distribution of the correlated events for 
the case of positive-negative triangular unit pulses. The grey line 
corresponds to the same distribution for the case of positive tri- 
angular unit pulses. 



Table 2. PSD model-parameters for the CCF simulations. 



Model-parameter 


Initial PSD 

xa(t), y G (t) 


Final PSD+ 
x(t),y(t) 


"low 


0.9,1-8 


uyi -0.08' u - MD -0.07 


"high 


2.3,2.8 


2.581^5 ,2.67 ± 0.06 


/bend (X10- 4 Hz) 


2.6,10 


14±|,18±3 



t These are the values used in the simulations. 

We produce 1000 artificial light curves using our pro- 
posed method and then for each one of them we estimate 
the rm s-flux relation using the prescription of lUttlev et al.l 
(2005). We select three different length segments of 0.5, 1.5 
and 5 ks consisting of 5,15 and 50 bins, respectively. Under a 
given binning scheme, for each flux value we estimate an av- 
erage rms and its standard deviation coming from the 1000 
surrogate data sets. The results are shown in the right panel 
of Fig.[TS]and as we can readily see the simulated light curves 
follow remarkably well the linear rms-flux relation for a vari- 
ety of time-scales below and above the /bend, corresponding 
approximately to 4.55 ks. This widely observed variability 
property is embedded in our artificial light curves in a nat- 
ural way depicting in a vivid way the fact that our artificial 
light curves are exact replicas of the observed light curves. 

This flare-directionality, which is actually mapped on 
the histogram of the 'bursty' light curves in the form of their 
positive skewness, is taken automatically into account dur- 
ing this newly proposed light curve simulation method. This 
means that for the establishment of confidence intervals it is 
of great importance to take correctly into consideration the 
distribution of the measurements since this can affect sig- 
nificantly the level of chance coincidence occurrences. Note, 
that for the case of Gaussian light curves (i.e. with minus- 



cule skewness) the method automatically is in accordance 
with the confidence intervals derived by TK95. 



7 INVARIANT QUANTITIES AND 
STATISTICAL DEPENDENCIES 

During any simulation process, it is very import understand 
which light curve properties are preserved and which are not. 
The TK95 procedure preserves only the underlying PSD of 
the observed data set, assuming a Gaussian distribution of 
measurements for all the cases. Our method preserves both 
the underlying PSD and the PDF (observed or parent). 

The preservation of the PDF means that all the sta- 
tistical moments of a given data set i.e. mean (/i), variance 
(<r 2 ), skewness (71), kurtosis (72), and so on, are identical 
between the observed and the surrogate data sets. Since, all 
these quantities are included in the input PDF (they de- 
scribe its shape e.g. asymmetry, peakedness), they are con- 
served by construction; during the last iteration step (rank- 
ing, step iv) all the measurements are redistributed based 
on the input PDF (the insets in the right panel of Fig U and 
the left panel of Fig. [9] are identical). The fact that all the 
statistical moments are conserved does not mean that all 
the statistical dependences of the measurements (e.g. non- 
linear interactions) are preserved. These are two completely 
different statistical quantities. 

The various statistical dependences of the measure- 
ments are characterised only by the Fourier transform 
of the joint cumulant functions, known as polyspectra, 
e.g. autospectrum, bispectrum, trispectum and so on (Ap- 
pendix|B]). Our method preserves only the second order joint 
cumulant (i.e. the covariance) of the input data set, and thus 
its Fourier transform, autospectrum, (and thus its squared 
amplitude, the PSD) is the only spectral quantity which is 
preserved in the final converged synthetic light curve. Thus, 
the only dependencies that are preserved are those corre- 
sponding to the covariance - all the higher-order dependen- 
cies are ignored. 

A very common source of confusion and misunderstand- 
ings is that there is a notion that preservation of higher order 
statistical moments (e.g. 71, 72 and so on) means preserva- 
tion of the higher order spectra. The statistical moments 
characterise only the shape of the PDF, whilst existence of 
potential dependencies between the data points are mapped 
only on the polyspectra. As we can see in Appendix iBl this 
confusion originates from the fact that a 2 and 71 (depicting 
the shape of the PDF) and are equal to the zero delayed joint 
cumulants (6*2(0), Cs(0, 0)) (i.e. for si = S2 = 0), however 
polyspectra are the Fourier transform of the joint cumulants 
i.e. summed over all Sj. 

A simple example to demonstrate this is the follow- 
ing. We con sider a genuine non-line ar process sim ilar to the 
one g iven in lProvenzale et alj l|l992t ) (also used bv lVio et al.l 
11992 



x\t) 



x(t) 1/5 + x(t) 3 wn{t) 



(4) 



in which wn(t) is a Gaussian white noise process with mean 
value of and standard deviation of 1. We solve this equa- 
tion for £(0.01) = 1 in t G [0.01, 50] with a At = 0.01 and 
then we scale the time axes from 1 to 5000 time units (t.u.) 
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Figure 17. Establishment of the statistical significance on the DCF estimates. [Left panel] The two artificially produced 'bursty' light 
curves, x(t), y(t) having the same random seed. [Right panel] The black points correspond to the DCF estimates of of x(t) and y(t), 
separated by 0.1 ks (100 s). The grey and black horizontal lines corresponds to the 90 per cent confidence bands as derived from the 
synthetic data sets of TK95 and the newly proposed method, respectively. 
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Figure 18. The rms-flux relation. [Left panel] The exponentiated light curve together with its binnned logarithmic periodog 
final PSD model (inset). [Right panel] The average rms-flux estimates coming from the 1000 simulated light curves using three 
binning schemes i.e. 0.5, 1.5 and 5 ks, respectively. 



ram and 
different 



in steps of 1 t.u. One realisation of the corresponding pro- 
cess, x n \(t), is shown in the left panel of Fig. 1191 Then, we 
shuffle randomly the data of this process (having an equal 
probability among all the numbers) yielding a white noise 
process, x s n(t), and we plot the data in the right of Fig.[T9l 
In this way none of the initial dependencies between the 
data points are preserved, but the two data sets still have 
identical PDFs, since they consist of exactly the same data 
points (Fig. I20|) . This PDF has non-zero higher order sta- 
tistical moments i.e. skewness and kurtosis of 71 = 2.57 and 
72 = 12.77, respectively. 

For each data set, we then estimate the normalised 
square amplitudes of its bispectrum, known as bicoherence, 



following iKim fc Powers! (Il979l ). for the f requencies inside 
the in ner triangle of the principal domain ()Hinich fc Messeil 
1995). We have divided each data set into 100 segments, each 
one consisting of 50 t.u. (i.e. 50 consecutive data points), and 
for the estimation of the bicoherence, he have averaged the 
corresponding Fourier transforms and biperiodograms. 

As we can see from Fig. 1211 the two data sets have gen- 
uinely different bicoherences i.e. genuinely different bispec- 
tra. The left panel of Fig. I21l exhibits a great deal of structure 
for various combinations of (/i,/2) depicting the nonlinear 
dependencies for the data set x n \(t). On the contrary, the 
right panel of Fig.[2T]shows (as expected) a rather quiescent 
behaviour for the shuffled data set, x s a(t), despite the fact 
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Figure 19. A time series process. [Left panel] A realisation of the nonlinear process (equation [4| , x n \(t), with the times rescaled in the 
range t £ [1,5000] t.u. and At = 1 t.u. [Right panel] A random shuffle of the x n \(t), x s a(t). 
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Figure 20. The PDF of the time series process shown in Fig. 1191 
which has a skewness and a kurtosis of 71 = 2.57 and 72 = 12.77, 
respectively. 



that it shares exactly the same PDF with the previous data 
set. 

This simple example shows vividly that despite the fact 
that the two data sets share exactly the same PDF, i.e. have 
the same high order statistical moments, they do not share 
the same bispectra i.e. the various dependences between the 
measurements are genuinely different. 



8 SUMMARY AND DISCUSSION 

We have presented a new algorithm able to produce artifi- 
cial light curves which are distributed based on a given PDF 
(parent or observed) and a given underlying PSD. Our pub- 
licly available algorithm combines and enhances the meth- 
ods of TK95 and SS96. The new method improves signifi- 



cantly on the widely used procedure of TK95 which is able 
to produce artificial light curves which are only normally 
distributed. Thus, for any sort timing studies, in which sim- 
ulated data sets are needed, our algorithm preserves all the 
genuine variability and statistical source properties yielding 
ensembles of truly random artificial data sets. 

The merits of our method can be summarised in the 
following lines: 

• It reproduces the exact variability properties of the ob- 
served data, since the synthetic light curves follow the input 
PSD. The input PSD can originate either from an actual 
observation or a theoretical model. 

• It reproduces the exact statistical properties of the ob- 
served data/theoretical model since it uses their/its PDF. 
Thus, the surrogate light curves carry all the statistical mo- 
ments and depending on the nature of the statistical study, 
the PDF corresponds either to the observed or the parent 
PDF. 

• Introduction of higher statistical moments (other than 
mean value and variance that characterise completely only 
the Normal distribution) as well as definition of genuinely 
positively probability distributions allow the construction 
of realistic 'bursty' light curves which can not be created by 
TK95. 

• For the special case of Gaussian light curves the method 
yields synthetic data sets are by construction equivalent to 
those of TK95. 

• For the special case of a parent log-normal distribution 
the simulated light curves exhibit the rms-flux relation. 

Particularly for the case of 'bursty' light curves, having by 
definition non-Gaussian positively defined PDFs which can 
be even sometimes described by right heavy-tailed PDFs, 
representative for extreme flaring states, this new method is 
the most appropriate for the correct establishment of confi- 
dence intervals of a given method e.g. CCF analysis. 

Due to its generality, the method can be employed to 
a vast variety of statistical analysis purposes involving light 
curves obtained across the electromagnetic spectrum for any 
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Figure 21. Bispectrum analysis. [Left panel] The bicoherence of the nonlinear data set, x n \(i) (Fig. H"9l left panel). [Right panel] The 
bicoherence of the shuffled data set, x s a(t) (Fig. Il9l right panel). 



object. The Monte Carlo simulation studies, which are cur- 
rently performed using the TK95 products, can now be ex- 
tended to statistically much more accurate synthetic light 
curves, thus providing us with robust results with respect 
to e.g. cross correlation analysis, establishment of detection 
significance for future missions (e.g. LOFT, CTA), detection 
and characterisation of variability, understudying of the ef- 
fects of irregular sampling. 
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APPENDIX A: DEFINITIONS AND 
NOMENCLATURE 

Below we briefly describe the discrete Fourier transform, the 
calculation of the periodogram, the PSD estimation and the 
derivation of the PDF. 



Al The periodogram 

Consider a light curve x(t) consisting of N equidistant ob- 
servations: {tk,x(tk)} for k — 1,2, . . . , N with a sampling 
period £bin, a mean value of fi and a standard deviation of 
a. The discrete Fourier transform (DF T) of the data set is 
defined following iPress et all (|l992l l TT l: 

JV 

DFT(j) = x(t k )e 2Mk ~ 1)i/N (Al) 

k=l 

yielding TV estimates for j = 0, . . . , N — 1, each one corre- 
sponding to a Fourier frequency fj depending on the parity 
of N (i.e. even or odd): 

At fo = (j = 0) the zero Fourier frequency compo- 
nent, DFT(0), corresponds always to the sum of the light 
curve estimates. 
For even N 

• Positive: f+=j/(Nt bin ) for j=l, N/2-1. 

• Negative: ff=-{N - j)/(Nt hia ) for j=N/2 + 1, . . . , N- 1. 

• Nyquist: f N/2 = / Nyq = l/(2i b in) for j = N/2. 

Note, that the negative frequencies are mirrored versions of 
the positive frequencies with opposite signs (around /Nyq) 

C, S- — f N/2-1 f N/2+1' ' ' ' ' — fl /jV — 1' 

For odd N 

• Positive: f+=j/(Nt Ua ) for j=l, ...,(N- l)/2. 

• Negative: /r=-(jV- j)/(Nt hin ) for j=(N + 1)/2, . . . , N- 
1. 

• Nyquist: There is no Nyquist frequency estimate. 

Note again, that the negative frequencies are mirrored ver- 
sions of the positive frequencies with opposite signs e.g. 

~f(N-l)/2 = /(JV+1J/2' ~fl ~ /jV-1- 

At a given frequency fj, DFT(j) is a complex number 
of the form q + wj 12 knd which carries information about 
the amplitude and the phase of the corresponding sinusoidal 

11 In this case the exponential function contains as a running 
index k — 1 instead of k since the data start for k = 1 and not 
k = 0. 

12 For the case of even N, the DFT(N/2) (i.e. at Nyquist fre- 
quency) is a real number since, from equation lAll the exponential 
function for j = N/2 is equal to 1 (for odd k) or -1 (for even k). 
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component. The amplitude of the sinusoid at a frequency, 
fj , is given by 



u/i = jjy/Be[DFT(j)] 2 + MDFT(j)] 2 
and its phase is given by 



(A2) 



fa = tag[DFT(J)] = ^ct & n{Im[DFT(j)],Re[DFT(j)}} (A3) 

taking values in the closed-open interval (— tt,-k]. For the 
complex number one may use (j> = but formally its phase 
angle is indeterminate. 

The periodogram of x(t) at a given Fourier frequency 
fj, P(fj), is defined as the squared amplitude (equation I A2|) 
of the corresponding sinusoid component 

1 



for j = 0, . . 



N 2 
,7V -1 



5 {Re[DFT(j)] 2 +lm[DFT(j)} 2 } 



(A4) 



Since the light curve consists only of real measurements, 
x(tk) £ R, there is a symmetry between the positive and the 
negative DFT estimates: DFT(j~) = [DFT(j + )]* where j" 
and j + represent the indices for the negative and positive 
frequencies, respectively, and the asterisk denotes complex 
conjugation. Thus, the amplitudes of the corresponding posi- 
tive and negative components are equal and the periodogram 
is estimated as 



P ifi) = Jp {MDFT(j)] 2 +Im[DFT(j)] 2 } 
even TV: j = 0, . . . , N/2 
odd N: j =0,...,(N- l)/2 



(A5) 



with fj — j/(iVt t bm). There is a plethora of normalisa- 
tion factors that ca n be applied to the periodogram (e.g. 
IVauehan et~ai1l2003h . In this work we employ the fractional 
root mean square (rms) normalisation: Nthm/t^ 2 an d the 
periodogram (equation I A5f) becomes 

2£bin 



{Re[DFT(j)\ 2 +Im[DFT(j)] 2 } 



(A6) 



With this normalisation the square root of the integral of the 
underlying PSD between two frequencies /i and ft yields 
the contribution to the fractional rms squared variability 
(i.e. a 2 /fi 2 ). Thus, integration between fi, and /N yq (even) 
or /(„_i)/2 (odd) yields the total rms squared variability. 

A2 Power spectral density estimation 

The 'statistical natural' estimator of the underlying power 
spectral densi ty (PSD), &{J\ is the periodogram, P(f). In 
the manner of lPriestlevI {l981), assume that the light curve, 
Xt, originates from a linear process of the form 



x t 



(A7) 



where et is a purely random Gaussian process and g u is a 
given sequence of constants satisfying ~}2^L g 2 < oo. At a 
given frequency, fj, P(fj) is then asymptotically distributed 
around the lP(fj) as 



i = i, 



N/2-1 (even JV) 
■ ' (JV-l)/2 (odd JV) 

j = N/2 (even TV) 



(A8) 



where xt represents the \ 2 distribution with v degrees of 
freedom (d.o.f.). This means that, for a given frequency, the 
standard deviation of the periodogram estimates is 100 per 
cent, automatically making the ensemble of periodogram es- 
timates an inconsistent estimator of the underlying PSD. 

In order to retrieve the £?{fj), one can use either bin- 
ning or maximum likelihood methodologies. For the former, 
the binned logarithmic period ogram has been proposed by 
IPapadakis fe Lawrence! (|l993l ) ensuring that the logarithmic 
periodogram estimates are normally distributed within each 
geometric mean frequency bin. Thus, PSD models can be fit- 
ted to the logarithmic periodogram estimates using a simple 
least-squares method, requiring Gaussianity within each bin. 
The latter should include at least 10 periodogram estimates 
a fact which partially limits the usefulness of the method 
for small data sets. Another approach is to fit a PSD model 
directly to the ensemble of periodogram estimates by per- 
forming maximum likelihood estimation which makes direct 
use of the underlying distribution at a given Fourier fre- 
quency (equation lA8[l . This is the approach used in this work 
and detailed references can be found in (e.g. | Anderson et al.l 
Il99d : IVaughan|[2005l : iBarret fc Vaughanll2012l )~ 

Consider an underlying PSD model, ^(fj;j), in which 
7 = {71, 72, . . . , 7n} is a vector consisting of the unknown 
model parameters such as normalisation, break/bend fre- 
quency, low/high frequency slopes etc. The probability of 
obtaining a given single periodogram estimate P(fj) for the 
given PSD model & > {fj\*f) is. 



[p(fi)\nfr,i)] = 



-P(fj)/s°(fj;f) 



e -P(fNy q )/g"(/Ny q ;?) 
. [irP(/Ny q )^(/Nyq^)] 1/; 



3 = 1, 



N/2-1 (even JV) 
' ' (JV-1J/2 (odd N) 



(A9) 



j = N/2 (even N) 



The constituent functions of the above piecewise expres- 
sion are usually referred to as 'scaled x 2 distributions' with 
two and one d.o.f. for the top and lower branch respec- 
tively. More precisely, these functions are special forms of the 
gamma distribution, r [v/2, &(f 3 ■; 7)] where v corresponds 
to the d.o.f. i.e. v = 1 corresponds only to the the Nyquist 
frequency, /N yq (j = N/2, even N), and v = 2 to all other 
frequencies (for either even or odd JVF^l 

The joint probability of obtaining the ensemble of peri- 
odogram estimates for the given PSD model is 



JV/2 (even JV) 
(JV-l)/2 (odd JV) 



if 



n 



*i[PUi)\&Uj\D\ 



(A10) 



since asymptotically (i.e. N — ► 00) the various periodogram 
est imates are stric tly independent at the Fourier frequencies 
fj (|Priestlevlll98lh (this is the reason why the periodogram 
is estimated only for these frequencies and not for intermedi- 
ate values). The maximum likelihood estimate of the model 
function parameters, a, is obtained by maximizing the above 
probability, or equivalently by minimizing the log-likelihood 



13 An even more general representation can be obtained 
through the Pearson's Ty pe III distribution (p. 930 in 



i s IV 

lAbramowitz fc Stegunl |l972j) 

p = v/2. 



for a 



0, P = ^(fj;j) and 
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function C — — 21nJzf which is equal to 
C = 

' { 2 E-r- 1 {i n W J ;7)] + ^ y }} + 

« ln[^P(/ Nyq )^(/ Nyq ; 7)] + 2 /ff™^ (even A) (All) 

2Ej:r i)/2 {ln[^(/,;7)] + ^g%} (odd A) 

In this work w e have employed two minimisation routines 
(see for detail s iPress et ai1ll992l): a direct search method, 
Nelder-Mead (|Nelder fc Meadlll965l ) and a stochastic func- 
tion minimizer, simulated annealing l|Kirkpatrick et all 
1 19831 ). The PSD models, <5*(/j;7*), which are usually fitted 
to the data have a power-law form (e.g. broken power-law, 
continuous bending power-law etc.) and, for these type of 
minimisation problems, both methods identical results. 

The joint confidence intervals for q model pa- 
rameters from a total of n components of a, 
{qi, Q2, ■ ■ ■ , a.q, Qg+i, Q-g+2, ■ ■ ■ , ct„ } can be estimated 
us ing the meth od of ICashl l|l979h . based on the theorem 
of IWilksl (|l938l ). Initially, a global minimum is found by 
varying all the n model parameters yielding (C m i n )n- The q 
parameters of interest are fixed to their best-fitting values 
and the rest, q + 1, q + 2, . . . ,n then are varied until a global 
minimum is reached corresponding to (C m i n )n-<j bt • The 
quantity AC = (C m m)n-9 b f — (C m in)n is then distributed as 
a x 2 distribution with q d.o.f. Thus, the 68.3 and 90 per 
cent single confidence intervals for one parameter (q — 1) 
correspond to a AC of 1 and 2.71, respectively. Similarly, 
the 68.3 and 90 per cent joint confidence intervals for one 
parameter (q — 2) correspond to AC of 2.30 and 4.61, 
respectively. In general, for a given confidence interval p, 
and a given value of q the corresponding value of AC is given 
by 2Q~ 1 (v/2,0,p), where Q _1 corresponds to the inverse 
of the generalised regularised incompl ete gamma function 
(defin itions for Q' 1 can be found in IChaudhrv fc Zubairl 
11994 ). 



A3 Probability density function estimation 

The probability density function (PDF) of the observations 
should always be represented by a positive real-valued dis- 
tribution and, depending on the purpose of the statisti- 
cal study, should correspond either to the parent or the 
observed distribution. The PDF is used in the proposed 
method (Sect. [2} to produce a sample of independent and 
identically distributed random variates. 

In general there are three ways to derive the probability 
density function (parent or observed) from a given data set. 
Note, that for the case of the parent distribution we need 
very large data sets to be able to match the overall vari- 
ability profile of the source under study. One approach is 
to fit a probability density function model, f(xi;ff), to the 
histogrammed data (where ff is a vector consisting of the 
unknown distribution's model parameters), using the maxi- 
mum likelihood method in a similar fashion to that described 
above in App. lA2l i.e. maximising the log-likelihood function 
JA In f (xi ; ff) . For pathological cases of histogrammed obser- 
vations exhibiting e.g. highly skewed, non-zero kurtosis in 



conjunction with extreme long-tailed distributions, one can 
use appropriate methodologies developed for su ch purposes, 
such as the method of generalised moments dWooldridgg 
l200lf ). the method of cumu lants (iFriskenl l200ll) and the 
method of factorial moments l|Bialas fc Pesch anski 1983) the 
latter being particularly usef ul in the presence o f low count 
rates i.e. high Poisson noise l|de Wolf et al.lll996l ). 

Another approach is to use directly a piecewise-c onstant 
representation of the unknown PDF (|Knutbl |2006l ). using 
directly the data set consisting of Xi observations (i — 
1,...,N): 

M 

f>(z) = ElcA n (£fc-i>z>&) (A12) 

where Nk is the number of data points in the k th bin, i>k is 
the width of the fc' bin, £k-i and are the edges of the 
fc th bin, and Yl(£ a , x, £p) is the boxcar function being equal 
to 1 for £ a ^ x ^ £/3 and otherwise. 

Finally, instead of using the PDF representation of the 
data set, one can use a cumulative distribution function by 
employing the empirical distribution function of the data 
set consisting of Xi observations (i — 1,...,N). Th is can 
be do ne by estimating the following quantity (after lOwenl 
l200ll) : 

1 N 

Sy(y) = -^^i-^'^v) (A13) 

i=i 

This can then be used directly in order to produce random 
numbers, which is the primary reason that we need to esti- 
mate the distribution of the data points. 



APPENDIX B: STATISTICAL MOMENTS, 
CUMULANTS AND POLYSPECTRA 

In the manner of lPriestlevi l|l98ll) . let A be a random vari- 
able with moment generating function M(t) then the cumu- 
lant generating function, K(t) is defined as: 

K(t) = In [M(t)} (Bl) 

By expanding the above expression in a power series we get: 

K(t) = k 1 t + k 2 ^r + ... + k r ^ (B2) 
2! rl 

The coefficient of t r / (r!) is called the r* cumulant. Only, the 
first three cumulants coincide with the first three statistical 
moments (mean, variance, skewness) and all the other are 
given by more complicated polynomial expressions i.e fci = 
fi, ki — a 2 , k-j, = 71, ki = 72 — 3<7 4 ,etc. 

Generalising the above to more random variables, 
let Xt be a process stationary up to order fc and let 
C(si, S2, ■ ■ ■ , Sk-i) denote the joint cumulant of order k of 
the set of random variables {Xt, Xt+ S1 , ■ ■ ■ , X t + Sk _ 1 } is the 
coefficient of (zi, 22, . . . , Zk) in the expansion of the joint 
cumulant generating function 

K(z!,z 2 , ...,z n )=\n [M(z 1 ,z 2 , z n )] (B3) 

in which M(zi, Z2, ■ ■ ■ , z n ) is the joint moment generating 
function. 

The second order joint cumulant, C*2(si), is simply 
the covariance, cov (X t , X t + S1 ) and the third order joint 
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cumulant, 03(81,83), is identical to the third order joint 
moment,7i(si, S2) (sometimes in economics this is called 
co- skewness and the n ext joint cumulant co-kurtosis e.g. 
iHwang fc SatcheJll999h . 

C 2 (s 1 ) = {(X t -n)(X t+31 - tJ ,)) (B4) 

C 3 (fli, aa ) = ((X t - fi)(X t+ai - fi)(X t+a2 - n)) (B5) 

for si = S2 = these two quantities are directly related to 
the variance and the skewness, C2(0) = a 2 and Cs(0) = 71a 3 
and these are two properties that are mapped on the PDF. 

The Fourier transforms of the correspond- 
ing higher order cumulants are called polyspectra. 

hk{fx, h, ■ ■ ■ , f n ) = 

OO OO 

■■■ £ C( S i,..., Sfc _ 1 ) e - 2 " (/l31+ "- +/fc - lSfe - l) (B6) 

s\ = — 00 s^_^ — — 00 

The second order polyspectrum is the autospectrum and its 
squared amplitude is the PSD, \h 2 (f)\ 2 = &(f). The third 
and the fourth order polyspectra are known as bispectrum 
and trispectrum respectively. These are the quantities that 
characterise the various dependences between the various 
measurements. The fact that two data sets have e.g. the 
same variance and skewness (i.e. 6*2(0), 6*3(0,0)), does 
not mean that they have the same covariance and third 
order joint cumulant, C*2(si) and Cs(si, S2), respectively. 
Thus, data sets which have the same statistical moments 
(i.e. same PDFs), does not mean that they have the same 
polyspectra. 

This paper has been typeset from a TgX/ DTfjX file prepared 
by the author. 
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